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Abstract 

Background: Binding free energy and binding hot spots at protein-protein interfaces are two important research 
areas for understanding protein interactions. Computational methods have been developed previously for accurate 
prediction of binding free energy change upon mutation for interfacial residues. However, a large number of 
interrupted and unimportant atomic contacts are used in the training phase which caused accuracy loss. 

Results: This work proposes a new method, PACVasa, to predict the change of binding free energy after alanine 
mutations. PACVasa integrates accessible surface area (ASA) and our newly defined ft contacts together into an 
atomic contact vector (ACV). A ft contact between two atoms is a direct contact without being interrupted by any 
other atom between them. A contact's potential contribution to protein binding is also supposed to be inversely 
proportional to its ASA to follow the water exclusion hypothesis of binding hot spots. Tested on a dataset of 396 
alanine mutations, our method is found to be superior in classification performance to many other methods, 
including Robetta, FoldX, HotPOINT, an ACV method of contacts without ASA integration, and ACV asa methods 
(similar to PACVasa but based on distance-cutoff contacts). Based on our data analysis and results, we can draw 
conclusions that: (i) our method is powerful in the prediction of binding free energy change after alanine mutation; 

(ii) p contacts are better than distance-cutoff contacts for modeling the well-organized protein-binding interfaces; 

(iii) p contacts usually are only a small fraction number of the distance-based contacts; and (iv) water exclusion is a 
necessary condition for a residue to become a binding hot spot. 

Conclusions: PACVasa is designed using the advantages of both p contacts and water exclusion. It is an excellent 
tool to predict binding free energy changes and binding hot spots after alanine mutation. 



Background 

A binding hot spot is a small area in a protein binding 
interface whose mutation can lead to a big change in bind- 
ing free energy. The determination of its accurate location 
in the interface is a fundamental problem in structural 
biology, and is useful for applications such as rational 
drug design and protein engineering [1]. In wet labs, a 
residues contribution to binding free energy can be deter- 
mined through mutation experiments. For example, ala- 
nine scanning mutagenesis [2] mutates interfacial residues 
individually into an alanine, and then measures the change 
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of binding free energy (A AG) to quantify the contribution 
of the side chain of the mutated residue. Based on these 
wet-lab experimental outcomes and databases [3-6], it has 
been reported that binding free energy is unevenly dis- 
tributed in protein interfaces [7]. In fact, there are always 
a small fraction of interfacial residues — the binding hot 
spot— which make major contribution to the binding [7,8] 
with A AG > 2 kcal/mol [3]. But wet-lab experiments 
are both time and cost expensive. Reliable computational 
methods are thus needed for accurate prediction of bind- 
ing free energy change. 

FoldX [9,10], Robetta [11,12] and CC/PBSA [13] are 
some well-known physics-based methods for this pre- 
diction problem. These methods use empirical terms 
(such as hydrogen bonds), the van der Waals terms and 
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Coulomb electrostatics to learn a linear function for esti- 
mating the effect on the change of binding free energy 
after residue mutations. However, the predicted energy 
by these methods has a large discrepancy from exper- 
imentally measured A AG [14]. Thus, other methods 
have been proposed to qualitatively identify binding hot 
spots. For example, protein sequences are used by [15] 
and ISIS [16], while protein tertiary structures are used 
together with docking techniques by [17]. Protein qua- 
ternary structures have been also widely used [18]. For 
example, Hotsprint [19] and HotPOINT [20] generate 
rules to identify binding hot spots from features such 
as conservation, accessible surface area (ASA), residue 
propensity and/or residue pairwise potentials. Machine 
learning models are also widely used for predicting bind- 
ing hot spots. Decision trees are used in MINERVA [14] 
to induce rules at different levels of protein information 
including structure, sequence and molecular interactions. 
Later, machine learning algorithms SVM and its ensemble 
are employed to combine energetic terms such as van der 
Waals potentials, solvation energy, hydrogen bonds and 
Coulomb electrostatics, and/or other protein sequences 
and structure information for a better hot spot prediction 
performance. Recently, Bayesian Networks are used to 
combine three main sources of information related to con- 
servation, FoldX-calculated A AG and atomic contacts for 
a novel probabilistic model of binding hot spots prediction 
[21]. Very recently, random forests have been proposed to 
predict hot spots [22] by using structural neighborhood 
properties of mutated residues and other conventional 
physicochemical features [23,24]. Besides alanine muta- 
tions, hot spots after mutations to any other type of 
residues are also investigated [6] and their binding free 
energy changes can be predicted [13,25] with good per- 
formance. Several of these methods are also assessed in 
a community-wide test for predicting mutation effects on 
protein-protein interaction affinity [26] . 

In spite of intensive research, the prediction still needs 
a big improvement. The existing methods usually used 
those atomic contacts based on Voronoi diagram or sim- 
ply defined by a distance threshold with little considera- 
tion on the local atomic organization of the contacts. If 
the distance threshold is too large, e.g., larger than 6 A, 
an atomic contact between two atoms i and ; may have 
no direct contact area, because the space between i and 
/ can accommodate other atoms. Such interrupted con- 
tacts constitute a large proportion of the traditionally used 
contacts. It is highly questionable whether they are really 
important to protein binding. In fact, important contacts 
in hot spot prediction [10,11] or those closely related to 
binding hot spots [14] are generally not interrupted, such 
as hydrogen bonds, salt bridges and n — n contacts. 

To overcome these drawbacks, we propose a novel clas- 
sifier fiACV asa for predicting A AG and binding hot 



spots. The main idea of /3ACVasa is to use atomic con- 
tact vector (ACV) of /3 contacts (that's why our classi- 
fier is named /3ACV for short) instead of distance-cutoff 
contacts, contact, found on ^-skeletons [27], is our 
newly defined contact [28]. A /3 contact between two 
atoms restricts that there is no other atoms between these 
two atoms, and requires that the two atoms should have 
enough direct contact area to form an interaction. The 
definition of contacts can filter out a lot of unimportant 
and interrupted distance-cutoff contacts. Our analysis has 
found that contacts are only a small fraction number of 
those contacts based on a distance threshold [28], but they 
are effective to distinguish crystal packing from homod- 
imers [28] and to predict protein-ligand binding affinity 
[29]. 

Another important idea is that the relative ASA prop- 
erties are integrated by our /3ACV classifier based on 
the water exclusion hypothesis of binding hot spots. The 
water exclusion hypothesis states that the topological 
shape of a binding hot spot and its surrounding residues 
can be characterized as an O-ring structure [3]. Few 
residues on the O-ring, which are largely exposed to bulk 
solvent water, can contribute significantly to the protein 
binding. Thus in /3ACV, the energy contribution of a 

contact to protein binding is required to be inversely 
proportional to its ASA. 

Our PACVasa was tested on a dataset of 396 alanine 
mutations to show its superior performance. We com- 
pared /3ACVasa with the following methods: (i) ACV 
methods using distance-cutoff contacts to reveal the 
importance of ft contacts to protein binding; (ii) a ft ACV 
method without ASA integration to confirm whether 
the water exclusion theory is necessary for binding hot 
spots; and (iii) several widely-used state-of-the-art meth- 
ods such as Robetta, FoldX, HotPOINT and KFC to show 
the overall better prediction capability of P ACV asa- 

Methods 

Dataset 

The data stored in the ASEdb database [3] and the muta- 
tions in BID [4] having A AG measurements are both 
used for evaluating our method. In total, our dataset con- 
tains 22 protein-protein complexes (detailed in Additional 
file 1: Table S2). All of them have quaternary structures 
in PDB and meet the following three requirements. First, 
no redundancy exists among these protein complexes. 
Given two protein complexes (e.g., interacted pair A and 
B, and interacted pair C and D), a sequence identity is 
calculated through BLAST with the default setting for 
A and C, A and D, B and C, and B and D, denoted by 
S(A,C), S(A,D), S(B,C) and S(B,D) respectively. These 
two protein complexes are redundant if S(A, C) > 40% 
and S(B,D) > 40%, or S(A, D) > 40% and S(B, C) > 40%. 
According to this criterion, most of the protein complexes 
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in our dataset are non-redundant. For those redundant 
complexes, the mutations in the similar proteins must be 
in different positions. Our requirement on this sequence 
identity is reasonable, since atomic contacts and ASA used 
in this work are derived from complexes only, rather than 
from sequences (such as required by conservation scores). 
Secondly, only alanine mutations are considered. Thirdly, 
mutated atoms before mutation must have at least one 
distance-cutoff atomic contact with the partner proteins. 
The mutated atoms are those atoms except N, CA, C, O 
and CB. Under these requirements, our dataset has 396 
alanine mutations (detailed in Additional file 1: Table S3). 
Of these mutations, 86 are binding hot spot residues 
having A AG > 2 kcal/mol. 

Atomic ft contacts in protein binding interfaces 

Atomic /3 contact is a recently proposed notion of atomic 
contacts for modeling the well-organized protein 3D 
structures [28]. Its detail can be found in [28]. For easy 
reference, we give a brief description of P contacts and 
a simple method to produce P contacts from a protein 
quaternary structure. 

Atomic ft contacts: a definition 

Given a quaternary structure of a protein complex p, a /3 
contact between two atoms i and / in p requires that (i) 
the spatial distance between i and ; is less than a thresh- 
old Td plus the sum of their van der Waals radii defined by 
[30] (distance-cutoff contacts for short), (ii) i and ; share 
a Voronoi facet in ps Voronoi diagram, and (iii) the con- 
tact cannot break ps /3 -skeleton. The /3 -skeleton [27] of 
a discrete set p is an undirected graph in computational 
geometry. In this graph, two points i and ; have an edge 
if angle ikj is sharper than a threshold determined by p, 
V/c g p, k 7^ This angle threshold is denoted as Z/3, 
which actually defines a forbidden region/r of the contact 
between i and y, e.g., the gray regions in Figure 1. When 
P = 1, namely Zp = 90°, fir is the sphere with the mid- 
point of i and / as the center and with cs length as the 
diameter as shown in Figure 1(b). This sphere is similar to 



van der Waals radii of atoms. The forbidden region fir of 
a p contact usually does not cover any other atoms. Oth- 
erwise, if there is an atom k in/r, for example as shown 
in Figure 1(a) when Zp = 90° or in Figure 1(c) when 
Zp = 75°, the contact between i and ; is not a P contact. A 
P contact suggests that its two atoms should have enough 
direct contact area to form an important interaction. The 
number of atomic p contacts in protein binding interfaces 
is only a small fraction number of distance-based contacts 
or less than half the number of contacts in the Voronoi 
diagrams when Td = 3.3 as found by [28]. Interestingly, 
the use of p contacts can achieve better prediction perfor- 
mance for distinguishing false binding of crystal packing 
from homodimers. 

A method to produce p contacts 

A protein complex p can be represented as an atomic 
P contact graph hip), if all of the heavy atoms are rep- 
resented by nodes, and the p contacts are represented 
by edges. To produce b(p) for p, Qhull is first used to 
obtain the Delaunay triangulation [31] for all nodes. After 
that, the distance threshold Td is used to remove those 
atomic contacts whose distances are too large. Td is set as 
3.3 A (the diameter of a water molecule 2.8 A plus 0.5 A). 
This threshold is an insensitive factor to P contacts when 
it is large enough. Please refer to the Additional file 1 
for an analysis of P contacts under several different T^s. 
Thirdly, each atomic contact is checked to guarantee that 
it satisfies the p skeleton requirements. To sharpen the 
difference of those mutations with higher A AG and those 
with lower A AG, the angle threshold Zp is set as 75° in 
this work, whose forbidden region fir is larger than that of 
Zp = 90° as shown in Figure 1(b) and (c). That is, Zp = 
75° is a stricter condition than using Zp = 90° to produce 
P contacts. The rationale to choose the stricter condition 
Zp = 75° is illustrated in the following situation. Assume 
(i) A, B and C are the center points of three atoms with 
van der Waals radii 1.8 A (for example, the radius of com- 
mon Carbon atoms in protein structures), (ii) there is a 
covalent bond between A and B (AB for short) with spatial 




(a) not a (3=1 contact (b) a B=l contact (c) not a p contact (d)a p contact 
when4?=9CP when 40=75* when 40=75* 

Figure 1 Examples of ft contacts and non-/? contacts. Three points, denoted by i,j and k, represent atoms. The dashed circles represent the van 
der Waals spheres in 2D space. The lines in yellow are of interest. 
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distance 1.5 A, and a non-covalent bond between A and 
C (AC for short) and one between B and C (BC for short), 
and (iii) the van der Waals sphere of A and that of C are 
circumscribed each other with the spatial distance 3.6 A, 
and the same for the van der Waals sphere of A and that of 
C. Then, the angle ZABC = 78°. A stricter threshold than 
78° is 75°, which was chosen by this work. 

Our prediction methods 

This section describes how to construct our PACVasa 
classifier, including how to define fiACV and how to 
integrate ASA into 0ACV. 

PACV: a vector representation for interfacial alanine 
mutations 

We use an atomic contact vector (ACV) [32] of con- 
tacts to represent an interfacial alanine mutation. We also 
use P contacts of the interfacial bound water molecules to 
update the vector, and use the atomic environment of the 
mutation to expand the basic vector. 

Constructing a basic P ACV vector: To produce a basic 
/3ACV for an alanine mutation mutf a of residue r in a 
protein complex p, we build the /3 contact graph b(p) for p. 
We then remove the coordinates of the mutated atoms in r 
to get a quaternary structure resulted from the mutation, 
denoted by p™ ut . Then, we produce another /3 contact 
graph b(p? ut ) for p™ ut . 

By the mutation mutf a , some contacts in b(p) may dis- 
appear in b(p™ ut ), namely those mutated contacts, while 
some new contacts can be formed in b(p™ ut ), called new 
contacts. Both of those mutated and new contacts are 
represented in a /3ACV vector. As the heavy atoms from 
the 20 standard residues are grouped into 8 atomic types 
(shown in Additional file 1: Table S4) by this work, our 
fiACV vector has 36 pairs of atomic types as elements. 
The value v(27, Tj) of each element in ACV with atomic 
types Ti and Tj is calculated, using Equation 1. 

v(Ti,Tj)= J2 E (i) 

(x,y)eM(Ti,Tj) (x,y) (x f ,y)eN(Ti,Tj) 

where x and x' are of the atomic type 77, y and / are of 
the atomic type Tj, (x,y) and (#',/) are two atomic pairs, 
*) is the spatial distance between a pair of atoms, and 
M(Ti, 7/) (or N(Ti, Tj)) is the set of all those mutated (or 
the set of all those new) contacts whose atomic types are 
Ti and Tj. Here term <i 2 (*, *) is specially used to follow the 
same idea as Coulombs law which also uses the inverse of 
squared distance. Note that the other common contacts 
between b(p) and b(p™ ut ) are not used in ACV. Alanine 
mutations of Ala are assumed to have insignificant A AG 
and alanine mutations of Gly are not considered. 

It can be seen that a basic /3ACV considers all muta- 
ted contacts and new contacts, including both across- 



interface contacts and those contacts from the same pro- 
teins or same biological units. However, atomic contacts 
between covalent-bond nearby atoms are not used in 
Equation 1. The covalent-bond nearby atoms of a given 
atom i are those atoms that have not more than three 
covalent-bond steps from /. For example, suppose i — 
j — k — I — m, where — indicates a covalent bond. 
From i, the covalent-bond step is 0 to i, is 1 to y, is 2 to /<, 
is 3 to / and is 4 to m, respectively. Thus, k and / are 
covalent-bond nearby atoms of /, while m is not. In ft ACV, 
the contacts between i and its covalent-bond nearby 
atoms are excluded from M or N in Equation 1. This is rea- 
sonable, because spatially close distances between i and 
its covalent-bond nearby atoms are more likely due to the 
rigidity of their covalent bonds. 

Bound water molecules in protein interfaces: Pro- 
tein folding and binding occur in a solvent environment 
in vivo. Water molecules are heavily involved in protein 
binding and sometimes they can form a compulsory part 
of the protein interfaces. In this work, a water molecule 
in PDB is considered as a part of a binding interface if 

(i) it has at least 3 potential hydrogen-bonds contacts, or 

(ii) it has 2 potential hydrogen-bond contacts and also has 
at least 2 other contacts with spatial distances less than 
4 A. A potential hydrogen-bond contact is required to 
have a spatial distance less than 3.2 A between a hydrogen 
donor (such as a nitrogen atom) and a hydrogen accep- 
tor (such as an oxygen atom). Water molecules under 
this requirement, named bound water molecules, are such 
closely involved in protein folding and binding that they 
can play an integral part. Bound water molecules are then 
grouped into the Oxygen atomic type with more than one 
hydrogen atom (shown in Additional file 1: Table S4) to 
update the values of the elements in the basic ACV vec- 
tor. We did not consider the contacts between any two 
water molecules. 

The neighbourhood atoms of mutated residues: Infor- 
mation of neighbourhood atoms of mutf a is used to 
expand the basic ACV vector. Assume that S is a set 
of atoms which have /3 contacts with the mutated atoms 
under Z/3 = 90°. For each atom in S including the mutated 
atoms, its nearby atoms are added into S. Then, an atomic 
vector with the above 8 atomic types is also used to rep- 
resent those atoms in S in the bound state. The value of 
its element Tk is calculated using v(Tk) b in Equation 2. 
Similarly, v(Tk) u in Equation 2 is used to calculate another 
atomic vector for representing the atoms in S in the 
unbound state. 

v(Tk) b = J2 E i° C - v{Tk) u = E j° C M 
jeS,tj=Tk jeS,tj=Tk 

(2) 
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where Ej oc (or Ej oc (u)) is the relative local ASA of atom ; 
in the bound (or unbound) state calculated via Equation 6 
below. Water molecules were not considered here. Thus, 
each basic PACV vector is now expanded by another 16 
atomic types for representing surrounding information 
of mutated atoms. So, the expanded PACV is a vector 
representation with 52 elements. 

PACVasa: integrating the water exclusion hypothesis into 
PACV 

Solvent water is compulsory for protein binding, but 
water exclusion — small accessible surface area (ASA) — 
is a necessary condition for a residue to become bind- 
ing hot spot [3,33,34]. Few highly exposed residues can 
make significant contribution to protein binding strength 
[34]. Thus, we integrate ASA information into each 
atomic pair of /3ACV in Equation 1, and name the 
method /3 ACV asa- We note that except Equation 1, the 
other definitions in PACVasa are the same as those in 
pACV 

Given a protein complex p, we take the following steps 
to integrate the water exclusion theory into Equation 1. 
The first step is to use NACCESS [30] to produce ASA for 
all of the atoms and residues in both bound and unbound 
states. For p in the bound state, we then define special 
ASA terms for an atom i using Equation 3, and for a 
residue Ri using Equation 4 and Equation 5. 



Ei = 



h Ri ~ 



ASAj 
50 



ASA 



Ri 



^ max{ASA™) 



■nSC 

h Ri ~ 



ASA 



Ri 



max(ASAf.) 



Bi = rnax(0, 1 - E t ) (3) 



B h l = rnax(0, 1 - E%) (4) 



B% = rnax(0,l - E%) (5) 



In Equation 3, ASA[ is accessible surface area of atom 
/, while Ei is its relative ASA, and Bi is its relative ASA 
burial compared to the maximum ASA, where number 
50 is roughly half of NACCESS -calculated ASA of a sin- 
gle water molecule without any neighbor atoms (the ASA 
of a water molecule is 98.47 = 4 x 3.14 x 2.8 2 A 2 ). 
In Equations 4 and 5, ASAf and ASAf are accessible 
surface area of backbone atoms (i.e., bb) and of side- 
chain atoms (i.e., sc) for a residue Ri, while £|. and B% 
are the relative ASA and the relative ASA burial of * e 
{bb,sc}. max(ASA^.) and rnax(ASA s £.) are the maximum 
ASA of backbone atoms and of side-chain atoms for the 
residue type of Ri, which are calculated in a triplet of 
ALA-7?;-ALA by NACCESS. These backbone atoms and 
side-chain atoms are defined in the same way as those 
in [30]. 



We compute the local ASA E l ? c and local ASA burial 
B l ° c of an atom i via Equation 6. 



cloc 



pbb 
v psc 



if i is a backbone atom of Ri 
if i is a side-chain atom of R; 



B L ! 



j oc J Bi x B^ if i is a backbone atom of Ri 



Bi x B%. 



if i is a side-chain atom of Ri 



(6) 



where the multiplication of relative ASA burial of both 
atom i and its residue is used to calculate local ASA burial 
B l f c . This is because relative ASA of both an atom and 
its residue are critical in describing the accessibility of an 
atom. For example, an atom may be buried with small ASA 
but its covalent-bond atoms might be exposed. When rel- 
ative ASA of atoms or residues are used individually, the 
performance was worse (data not shown). 

To integrate water exclusion theory into Equation 1, we 
determine the value v(Ti, Tj) of each element in ft ACV asa 
through Equation 7 instead of Equation 1. 



v(Ti, Tj) = 

{x,y)eM{Ti,Tj) 



B loc x B loc 



E 

{x',y')eN{Ti,Tj) 



i 2 

nloc 



xB<° 



loc 



Vy) 



(7) 



where T*, x, y, x' /, M and N have the same meaning as 
those in Equation 1. 

Comparison of contacts with distance-cutoff 
contacts: To compare the performance of /3 contacts 
with distance-cutoff contacts for predicting A AG, 
ACV asa based on distance-cutoff contacts is constructed 
in a similar way to constructing /?ACVas,4. To further 
show the importance of ft contacts in protein binding 
interfaces, non^ACV^^ is also constructed for alanine 
mutations at the setting of Z/3 = 90°. In non^ACV^^, 
the values of its elements are the difference of the val- 
ues of the 52 elements between ACVas,4 and ^ACVas,4. 
To highlight the advantage of contacts, ACV^^^ is 
also evaluated with different spatial distance thresholds 
(from 2.9 A to 5 A) for defining atomic contacts across 
interfaces and within binding sites. 

Our ACV asa classifier and its variants described 
above are summarized in Table 1. 

Table 1 Description of fiACV A SA and its variant methods 



Methods 


Description (of the representation for an 




alanine mutation) 


£ACV 


An ACV of p contacts without ASA integration 


PACVasa 


An ACV of p contacts with ASA integration 


ACV^ 


An ACV of distance-cutoff contacts with ASA integration 


r\or\ PACVasa 


The difference of PACVasa and AO/ asa 
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Ridge regression: predict A AG and binding hot spots 

Ridge regression in Matlab is used here to learn a rela- 
tion between atomic contact vectors and A AG. By this 
regression, values in each column are normalized for 
the training dataset. Ridge regression minimizes average 
square error SE between the experimental (AAG?) and 
the predicted A AG? in the training data with N mutations 

where ^= ^ (AA g^ A ^ 2 . 

In our evaluation, leave-one-out cross-validation is used 
for all of the 396 mutations, and then the correlation coef- 
ficient R and average standard deviation 8 = +/SE are 
calculated. Under this evaluation framework, there is one 
outlier prediction by P ACV asa and one outlier by pACV 
for the whole dataset with 396 mutations. These outliers 
have less than -3 kcal/mol predicted A AG, or more than 
11 kcal/mol predicted A AG, as shown in Additional file 1: 
Table S3. This may be due to limited alanine mutations of 
a high A AG in the dataset. 

In this work, predicted hot spot residues are those 
residue mutations with a predicted A AG >2 kcal/mol, 
same as the true hot spot definition. 

Hot spot prediction and evaluation measures 

p ACV asa is also assessed by applying to the classification 
problem of binding hot spots. Classification performance 
is measured by precision(p.)> recall(r.)> accuracy (ace) and 
Fl whose definitions are given in Equation 8. 



precision(p.) = 
recall(r.) = 
accuracy (ace) = 
Fl = 



TP 



TP + FP 
TP 

TP + FN 

TP+TN 
TP+TN + FP + FN 
2 x precision x recall 
precision + recall 



(8) 



where binding hot spots are considered as the true cases, 
while non-hot spots as the false cases; TP, FP, TN and FN 
are true positives, false positives, true negatives and false 
negatives, respectively. Hence, precision is the number of 
correct hot spot predictions divided by the number of pos- 
itive predictions, recall is the fraction of correct hot spot 
predictions over all hot spots, while accuracy is the num- 
ber of correctly predicted hot spots and non-hot spots 
divided by the number of all mutations. These measures 
are also used in [14,20,35] with the same definitions. 

Results and discussion 

P contacts are better than distance-cutoff contacts for 
predicting AAG 

Our /3ACVasa classifier is compared with ACVas,4 and 
with nonpACVASA to show the importance of p con- 
tacts in the prediction of AAG under alanine mutations. 



The prediction results are presented in Figure 2(a), (b) 
and (c). 

It can be seen from Figure 2(a) and (b) that p ACV asa 
has a better AAG prediction performance according to 
both correlation coefficient R and average standard devi- 
ation 8, The number of P contacts used by P ACV asa 
is only a small fraction of the number of distance-cutoff 
contacts used by ACVas,4. For example, there are 54,286 
distance-cutoff contacts across binding interfaces for the 
22 protein complexes, but there are only 9,830 P contacts 
across the binding interfaces (Zfi = 90°), and 4,096 P 
contacts under the setting of Z/3 = 75° which is actually 
used by p ACV asa - So, PACV asa uses only 7.55% number 
of distance-cutoff atomic contacts but it achieves a better 
prediction performance. 

The comparison between p ACV 'asa and non^ACV^^ 
(Figure 2(a) and (c)) further suggests the importance of p 
contacts in AAG prediction. In Figure 2(c), non/3ACVAS,4 
has much lower R (0.469) and a higher 8 (1.486) than 
P ACV asa> but non^ACV^^ uses all non-/? contacts of 
Zfi = 90°, that is, 81.8% number of distance-cutoff atomic 
contacts. 

A lot of alanine mutations are not binding hot spot 
residues, having a small AAG, i.e., <2 kcal/mol. These 
mutations heavily affect the calculation of R and 8, On 
the other hand, the prediction of residue mutations with 
a high AAG is more important. Thus, the classification 
performance for these binding hot spots is also assessed. 
The results are shown in Table 2. It is noted that Fl is not 
the objective function to be optimized in the regression 
process. 

In Table 2 among ACV asa> ACV asa and non 
^ACVas,4, P ACV asa has the highest precision, recall, 
Fl and accuracy, while non/3ACVA£4 has the lowest. 
For example, 0 ACV asa s Fl is 0.604, 0.122 higher than 
non/* AC Vas^s Fl. However, ACV asa and non^ACV^^ 
show quite similar performances. The reason would be 
that non-/? contacts (used by non^ACV^^) are often 
dominant in distance-cutoff contacts (used by ACV asa)* 

The performances of ACV asa with different spatial dis- 
tance thresholds are shown in Table 3. We can see that the 
performance has a growing tendency when the threshold 
increases. Nevertheless, the best performance of ACV asa 
(Fl = 0.5 in Table 2) is much lower than that of P ACV asa 
(Fl - 0.604). Of special interest, when the spatial dis- 
tance threshold is set at 3.6 A, ACV asa has a number 
of distance-cutoff contacts nearly the same as the num- 
ber of our used p contacts. In this special case of almost 
the same number of contacts used, ACV asa has much 
worse performance than p ACV asa> and only about half 
of the distance-based contacts are p contacts across the 
22 protein-protein binding interfaces. These results affirm 
that p contacts are advantageous over distance-based 
contacts for predicting binding hot spot residues. 
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Experimental AAG 
(a) AAG prediction by (3ACV AS a 





Experimental AAG 

(b)AAG prediction by ACV^sa based 
on distance-cutoff contacts 





R=0.469 
5=1.486 



Experimental AAG 

(c) AAG prediction by non(3ACV A SA 
based on non-/3 contacts 




Experimental AAG 
AAG prediction by Robetta 



Experimental AAG Experimental AAG 

(d) AAG prediction by (3ACV without (e) AAG prediction by FoldX (f) 
ASA integration 

Figure 2 AAG predicted by different methods. In (a)-(d), 'o' represents non-polar residues, while Y represents polar residues. R is the Pearson 
correlation coefficient and 8 is the average standard deviation. R is specially calculated for (a) and (d) based on the data set after removing the one 
outlier prediction. R is slightly changed to 0.543 or 0.51 5 respectively when the outlier is not removed. The two diagonal red lines represent 
y = x± 1.5. 



Water exclusion is a necessary condition of hot spot 
binding 

Literature work has reported that water exclusion is a 
necessary condition for an interfacial residue to become 
a hot spot residue [3,33]. To confirm the importance of 
water exclusion in the prediction of AAG, the perfor- 
mance by /3 ACV when ASA is not integrated is assessed. 



Table 2 Prediction performances by different methods for 
the same set of binding hot spots 



Methods 


Precision 


Recall 


F1 


Accuracy 


PAC\f A5A 


0.615 


0.593 


0.604 


0.830 


ACV^ 


0.526 


0.477 


0.500 


0.793 


r\or\ PACVasa 


0.513 


0.454 


0.482 


0.788 


PACV 


0.564 


0.616 


0.589 


0.813 


FoldX 


0.400 


0.488 


0.440 


0.730 


Robetta 


0.526 


0.465 


0.494 


0.793 


HotPOINT 


0.439 


0.547 


0.487 


0.750 


KFC2a 


0.443 


0.767 


0.562 


0.740 


KFC2b 


0.521 


0.570 


0.544 


0.793 



Table 3 Prediction performance and the numbers of used 
contacts by PJKCVasa and JKCVasa 



Methods 


Distance 1 


#contacts 2 


Precision 


Recall 


F1 


Accuracy 


PWasa 




2,881 


0.615 


0.593 


0.604 


0.830 


ACV^ 


2.9 


347 


0.486 


0.419 


0.450 


0.778 




3.0 


513 


0.465 


0.382 


0.420 


0.770 




3.1 


715 


0.394 


0.302 


0.342 


0.747 




3.2 


966 


0.487 


0.442 


0.463 


0.778 




3.3 


1,293 


0.450 


0.419 


0.434 


0.763 




3.42 


1,884 


0.438 


0.372 


0.403 


0.760 




3.5 


2,394 


0.494 


0.442 


0.466 


0.780 




3.55 


2,789 


0.443 


0.407 


0.424 


0.760 




3.6 


3,123 


0.437 


0.360 


0.395 


0.760 




4 


7,542 


0.463 


0.430 


0.446 


0.768 




4.5 


15,389 


0.482 


0.465 


0.473 


0.775 




5 


26,752 


0.488 


0.465 


0.476 


0.778 



1 : The spatial distance threshold of two atoms. 

2 :The number of atomic contacts involving in the 396 mutations, including 
mutated contacts and new contacts. 
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This prediction performance is shown in Figure 2(d) 
and Table 2. Comparing Figure 2(d) with Figure 2(a), 
PACVasa has a better regression performance than 
PACV indeed. It also has a better Fl performance as 
seen in Table 2. These results confirm that water exclu- 
sion plays an important role in hot spot prediction, and it 
should be a necessary condition for an interfacial residue 
to become a binding hot spot residue in protein-protein 
complexes. 

Our method (UKCVasa is superior to several widely-used 
methods 

Our PACVasa classifier is also assessed against the state- 
of-the-art methods FoldX [9,10], Robetta [11], HotPOINT 
[20] and KFC [36]. The prediction performances of 
these previous methods were obtained through their web 
servers (Robetta, HotPOINT and KFC) or the standalone 
executable program (FoldX with default settings). 

Comparison results 

Figures 2(e) and 2(f) show that the prediction perfor- 
mance of FoldX and Robetta are much worse than our 
PACVasa- These two methods have a R of 0.324 or 0.485, 
much smaller than fiACV asa's 0.569; their 8 is 1.788 or 
1.554, much larger than PACVasa s 1-349. Table 2 also 
shows their classification performance on the 396 muta- 
tions: FoldX's Fl is 0.44, while Roberta's Fl is 0.494, both 
worse than p ACV asa's 0.604. From Table 2, our method 
also has better classification performance than Hot- 
POINT and KFC. Other performance comparison results 
are provided in the Additional file 1 when tested on BID 
(including protein-peptide interfacial residues) or under 
the leave-one-complex-out cross-validation framework. 

An example of hot spot predictions 

We use 3HFM as a case study to illustrate the difference 
of the binding hot spot prediction results by PACVasa* 



FoldX and Robetta. The 3HFM complex is an antibody- 
antigen binding between HyHEL-10 and hen egg white 
lysozyme. According to ASEdb, a total of 25 alanine muta- 
tions were experimented, 11 of which have A AG more 
than 2 kcal/mol. 

Our PACVasa correctly identified 9 binding hot spot 
residues with a recall of 0.818, but made 3 false positive 
predictions with a precision of 0.75 (Figure 3). This gives 
an Fl of 0.783. However, FoldX made only one hot spot 
prediction which is correct with a recall of 0.091, and 
Robetta has a recall of 0.455 (5 out of 11) and a precision 
of 0.833 (5 out of 6), namely an Fl of 0.588. Both of these 
methods have a lower prediction performance than our 
PACVasa* What is more important is that the four posi- 
tive predictions correctly made only by our PACVasa* not 
by FoldX or Robetta, have a high A AG, such as Trp in 
position 98 of Chain H (A AG = 5.5 kcal/mol) and Tyr in 
position 50 of Chain L (A AG = 4.6 kcal/mol). Please refer 
to Additional file 1: Table S3 for detail. 



Conclusion 

A new classifier PACVasa has been proposed to pre- 
dict A AG and binding hot spot residues. The novel idea 
of this classifier is to integrate the water exclusion the- 
ory into P contacts. Tested on a data set of 396 alanine 
mutations, PACVasa has been found to outperform many 
other methods. This confirms that P contacts are truly 
better than traditional distance-cutoff contacts to cap- 
ture the energetic characteristics of protein folding and 
binding. This also confirms that water exclusion is a nec- 
essary condition for a residue to become a binding hot 
spot residue. 

Availability of supporting data 

All the supporting data are included as additional files. 




(a) (b) 

Figure 3 Prediction results by jSACV^ for the residues in the interface between Chain Y and Chain HL(together) in 3HFM. In (a) and (b), 

the true predicted hot spot residues are in magenta, the false predicted non-hot spot residues are in yellow, the false predicted hot spot residues 
are in orange, while the true predicted non-hot spots are in cyan; X-YZZ stands for residue Y in position ZZ of Chain X. 
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Additional file 



Additional file 1 : This additional file covers an analysis on /3 contacts 
of different Tjs, more evaluation results and related discussions 
(including the statistical significance of the difference among 
Figure 2(a) to 2(d), the dataset and evaluation on BID, the evaluation 
under leave-one-complex-out cross-validation, and a discussion on 
using the 396 mutations), the groups of atomic types, and the detail 
of our dataset. 
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